Method for rapidly searching elements or attributes or for rapidly filtering fragments in binary representations of structured, for example, XML-based documents

ABSTRACT

A method serves to encode textual paths for indexing and querying structured, for example, XML-based documents and serves to execute and improved filtering of binarily represented XML documents. A development of the method results in all indices being identical even in the event that a polymorphism is inserted therein. When storing these textual paths for indexing or querying, only one smaller volume of data has to be stored or transmitted. A comparison of this data can also subsequently ensure more rapidly during a query since the volume of data to be compared is smaller.

FIELD OF TECHNOLOGY

The invention relates to methods whereby structured documents, forexample XML-based or SGML-based documents, are queried based on textualpath expressions. Textual paths are, for example, context paths asdescribed in [1] for instance, or textual path details as specified in[2] for instance, for indexing and querying structured, for exampleXML-based, documents.

BACKGROUND

A system is known from [3] whereby textual paths are used for indexingthe contents of an XML document. Absolute paths and partial paths toeach element in a document are here stored in, for instance, a hashtable. These elements are then referenced based on the storage addressin the stored document.

A query language is furthermore known from [4] which is able toformulate queries, to a database for example, based on textual pathexpressions.

The object of the invention is to disclose methods for searchingelements or for filtering fragments in binary representations of methodsaccording to the invention.

SUMMARY

The invention relates essentially to a method for encoding textual pathsfor indexing and querying structured, for example XML-based, documentsand for the improved filtering of XML documents represented in binaryform. A result of applying the method is that the indices will also beidentical if polymorphism is used. Only a small volume of data has to bestored or transmitted when said textual paths for indexing or queryingare stored. This data can consequently also be compared more quicklyduring a query as the volume of data being compared is smaller.

The invention is explained in greater detail below with the aid ofexemplary embodiments shown in the drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIGS. 1A to 1C show the structure of an encoded path, of a lossy encodedpartial path, and of a loss-free partial path, and

FIGS. 2A and 2B are a graphical representation of an absolute path andof a partial path.

DETAILED DESCRIPTION

As described at the beginning, textual paths can be used for indexingthe contents of an XML-based document so that the data which is to saysimilarly to the manner indicated in [1].

In each case in schematic form, FIG. 1A shows encoding according to theinvention of an absolute path, FIG. 1B shows the encoding of a lossypartial path, and FIG. 1C shows encoding according to the invention of aloss-free partial path.

In order to distinguish these three types of paths, a path type PT issignaled with two bits at the start of each of the three codes shown byway of example.

If a path proceeding from a root node of an underlying data structure ispossible for indexing, encoding, as shown in FIG. 1A, can only takeplace by specifying the path type PT as an absolute path followed by theabsolute path AP. FIG. 2A shows such an absolute path AbsP proceedingfrom a root node R. It is worth mentioning here that path encoding ispermitted exclusively using schema branching codes SBC and treebranching code TBC, although, for example according to the definition in[1], what are termed position codes would have to be inserted.

FIG. 2B shows a tree-type data structure with a partial path TeilP whichdiffers from an absolute path in not proceeding from a root node R. Inthe case of partial path encoding the first node in the path is onlyspecified by the type code in relation to a general base type, e.g. theur-type, this means that, as shown in FIG. 1B, after indication of thepath type PT an absolute type code ATC is coded. The residual path canthen be coded by indicating a relative path RP as shown in [1] and whereapplicable can be modified as shown in the first case. However, thisencoding of the path is lossy, because the name of the first node cannotbe determined, but only the data type. However, this is not of anyrelevance for many use cases.

Loss-free encoding retaining the described properties can, however, beachieved by means of the encoding shown in FIG. 1C which, in addition tospecify the path type PT, the absolute type code ATC, and the relativepath RP, also contains the number N of types or child elements followedby at least one absolute type AT or a tuple from an absolute type AT anda schema branching code SBC of a child element. The number NT indicatesthe number of nodes which can contaon the specified partial pathproceeding from a child element. The type codes of these node types AT,AT′, . . . refer to the same basic type and are arranged, for example,in ascending order according to the codes. By specifying the schemabranching codes SBC, . . . it is possible to signal specific childelements from which the partial path proceeds if several child elementsof the type with the absolute type code ATC of the partial path TeilPhave been declared.

Encoding of the paths in an index by means of the method according tothe invention is advantageous because often no decoding or onlytranscoding of documents transmitted in encoded form is required duringindexing. The storage requirements for the index can also be reduced,which allows queries to be executed faster or which reduces the requiredcomputing effort. Encoding of the paths in database querying isadvantageous because the volume of data transmitted from the deviceaccepting

In an advantageous embodiment of the method according to the invention,textual paths for indexing elements and/or attributes are encoded insuch a way that the data types which are instanced in the path and whichare derived through polymorphism are uniquely replaced by standardizeddata types, each standardized data type being specified in a mannerwhereby, proceeding from the basic data type of the respective datatype, a data type is searched which contains the element or attributefollowing in the path and which can be uniquely determined withreference to its derivation from the basic type. As a result, theencoded textual path is uniquely recognized by its bit pattern and thesearched elements and/or attributes can be located with this in theencoded, XML-based document.

This standardizing can be applied generally to textual paths, which isto say not just to textual paths for indexing but also to context paths,as described in [1], for encoding. The advantage of said standardizingis that identical textual paths of different documents are standardizedto a single binary representation even if the nodes in the documentwhich are contained in the path differ from the data type. This meansjust a single bit pattern per path has to be taken into account during asearch for textual paths with the aid of bit patterns of the encodedpaths. A final further advantage is that the resulting bit patterns aregenerally shorter than corresponding

The following references are cited in this document, and each of thelisted references are incorporated by reference in their entiretyherein:

-   -   [1] “ISO/IEC FCD 15938-1 Information technology-Multimedia        content description interface: Systems”, /7, ISO/IEC JTC 1        SC29/WG11/N4001, Singapore, March 2001    -   [2] XML Path Language, Version 1.0, W3C Recommendation, 16 Nov.        1999, http://www.w3.org/TR/xpath.    -   [3] dbXML-XML Database Application Server, Version 0.4, The        dbXML Group, 2000, http://www.dbxml.org/docs/CoreSpecs.pdf.    -   [4] J. Robie, J. Lapp, D. Schach, XML Query Language (XQL),        1998, http://www.w3.org/TandS/QL/QL98/pp/xgl.html.    -   [5] XML Schema Language, XML Schema Part 1: Structures, §6, W3C        Recommendation, 2 May 2001 http://www.w3.org/XML/Schema.

1. A method for searching of elements in binary representations ofstructured XML-based documents, comprising the steps of: encoding atextual path for indexing elements or attributes, wherein the pathcomprises data types; replacing at least one of the data types, that areinstanced in the path and which are derived through polymorphism, withstandardized data types, wherein the respective standardized data typeis obtained by deriving a basic data type of the respective data type,and searching a data type to establish that the searched data typecontains the element or attribute following in the path and that isuniquely determined with reference to its derivation from the basic datatype; providing a unique identification for the textual path using a bitpattern after encoding, wherein the bit pattern includes the searchedelements or attributes.
 2. The method according to claim 1, whereinafter the step of searching the data type to establish the searched datatype contains the element or attribute following in the path and that isuniquely determined with reference to its derivation from the basictype, the standardized data type is further defined by having the lowestor highest type code or the lowest or highest number of inheritancesteps proceeding from the basic type.
 3. The method according to claim1, wherein the textual path is encoded by specifying a path type and anencoded absolute path, without the use of position codes.
 4. The methodaccording to claim 1, wherein the textual path is encoded by specifyinga path type, an absolute type code, and an encoded relative path,without the use of position codes.
 5. The method according to claim 4,wherein the textual path is encoded by additionally specifying a numberof types and by means of a number of tuples, determined by the number oftypes from a respective absolute type and a respective schema branchingcode.
 6. A method for filtering fragments in binary presentations ofstructured XML-based documents, comprising the steps of: encoding acontext path for indexing elements or attributes wherein the pathcomprises data types; replacing at least one of the data types that areinstanced in the path and which are derived through polymorphism withstandardized data types, wherein the respective standardized data typeis obtained by deriving a basic data type of the respective data type,and searching a data type to establish that the searched data typecontains the element or attribute following in the path and that isuniquely determined with reference to its derivation from the basic datatype; providing a unique identification for the context path using a bitpattern after encoding, wherein the bit pattern includes the searchedelements or attributes.